representation change
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Kosson, Atli, Messmer, Bettina, Jaggi, Martin
Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ early in training by using lower values for the learning rate $\eta_t$. In this work we argue that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates $\mathbf{u}_t$ too large? We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.
Similarity of Pre-trained and Fine-tuned Representations
Goerttler, Thomas, Obermayer, Klaus
However, Representation similarity analysis shows that the Oh et al. (2021) found out that, especially in the case of most significant change still occurs in the head cross-domain adaption, where the fine-tuning task does not even if all weights are updatable. However, recent come from the same distribution as in training, also an adaptation results from few-shot learning have shown that of earlier layers is very beneficial. Neyshabur et al. representation change in the early layers, which (2020) investigated what is transferred in transfer learning are mostly convolutional, is beneficial, especially by shuffling the blocks of inputs. They confirmed that lower in the case of cross-domain adaption. In our paper, layers are responsible for more general features and that a we find out whether that also holds true for transfer network with pre-trained weights stays in the same basin of learning. In addition, we analyze the change solution during fine-tuning. of representation in transfer learning, both during pre-training and fine-tuning, and find out that This paper analyses representation obtained by models having pre-trained structure is unlearned if not usable.
Does MAML Only Work via Feature Re-use? A Data Centric Perspective
Miranda, Brando, Wang, Yu-Xiong, Koyejo, Sanmi
Recent work has suggested that a good embedding is all we need to solve many few-shot learning benchmarks. Furthermore, other work has strongly suggested that Model Agnostic Meta-Learning (MAML) also works via this same method - by learning a good embedding. These observations highlight our lack of understanding of what meta-learning algorithms are doing and when they work. In this work, we provide empirical results that shed some light on how meta-learned MAML representations function. In particular, we identify three interesting properties: 1) In contrast to previous work, we show that it is possible to define a family of synthetic benchmarks that result in a low degree of feature re-use - suggesting that current few-shot learning benchmarks might not have the properties needed for the success of meta-learning algorithms; 2) meta-overfitting occurs when the number of classes (or concepts) are finite, and this issue disappears once the task has an unbounded number of concepts (e.g., online learning); 3) more adaptation at meta-test time with MAML does not necessarily result in a significant representation change or even an improvement in meta-test performance - even when training on our proposed synthetic benchmarks. Finally, we suggest that to understand meta-learning algorithms better, we must go beyond tracking only absolute performance and, in addition, formally quantify the degree of meta-learning and track both metrics together. Reporting results in future work this way will help us identify the sources of meta-overfitting more accurately and help us design more flexible meta-learning algorithms that learn beyond fixed feature re-use. Finally, we conjecture the core challenge of re-thinking meta-learning is in the design of few-shot learning data sets and benchmarks - rather than in the algorithms, as suggested by previous work.
Features, Projections, and Representation Change for Generalized Planning
Generalized planning is concerned with the characterization and computation of plans that solve many instances at once. In the standard formulation, a generalized plan is a mapping from feature or observation histories into actions, assuming that the instances share a common pool of features and actions. This assumption, however, excludes the standard relational planning domains where actions and objects change across instances. In this work, we extend the standard formulation of generalized planning to such domains. This is achieved by projecting the actions over the features, resulting in a common set of abstract actions which can be tested for soundness and completeness, and which can be used for generating general policies such as "if the gripper is empty, pick the clear block above x and place it on the table" that achieve the goal clear(x) in any Blocksworld instance. In this policy, "pick the clear block above x" is an abstract action that may represent the action Unstack(a, b) in one situation and the action Unstack(b, c) in another. Transformations are also introduced for computing such policies by means of fully observable non-deterministic (FOND) planners. The value of generalized representations for learning general policies is also discussed.
Meta-Search Through the Space of Representations and Heuristics on a Problem by Problem Basis
Fuentetaja, Raquel (Universidad Carlos III de Madrid) | Barley, Michael (University of Auckland) | Borrajo, Daniel (Universidad Carlos III de Madrid) | Douglas, Jordan (University of Auckland) | Franco, Santiago (University of Huddersfield) | Riddle, Patricia (University of Auckland)
Two key aspects of problem solving are representation and search heuristics. Both theoretical and experimental studies have shown that there is no one best problem representation nor one best search heuristic. Therefore, some recent methods, e.g., portfolios, learn a good combination of problem solvers to be used in a given domain or set of domains. There are even dynamic portfolios that select a particular combination of problem solvers specific to a problem. These approaches: (1) need to perform a learning step; (2) do not usually focus on changing the representation of the input domain/problem; and (3) frequently do not adapt the portfolio to the specific problem. This paper describes a meta-reasoning system that searches through the space of combinations of representations and heuristics to find one suitable for optimally solving the specific problem. We show that this approach can be better than selecting a combination to use for all problems within a domain and is competitive with state of the art optimal planners.
AI and Consciousness: Theoretical Foundations and Current Approaches
The Association for the Advancement of Artificial Intelligence presented the 2007 Fall Symposium Series on Friday through Sunday, November 9-11, at the Westin Arlington Gateway, Arlington, Virginia. The titles of the seven symposia were (1) AI and Consciousness: Theoretical Foundations and Current Approaches, (2) Artificial Intelligence for Prognostics, (3) Cognitive Approaches to Natural Language Processing, (4) Computational Approaches to Representation Change during Learning and Development, (5) Emergent Agents and Socialities: Social and Organizational Aspects of Intelligence, (6) Intelligent Narrative Technologies, and (7) Regarding the "Intelligence" in Distributed Intelligent Systems. Is it possible to build a conscious machine? Is trying to design and build a conscious machine helpful to understanding the nature of consciousness? These questions have been at the core of AI since its beginnings.